Skip to content

Fix async checkpoint timing in DCP recipe#3688

Open
patrocinio wants to merge 1 commit intopytorch:mainfrom
patrocinio:3584_Asynchronous_Saving
Open

Fix async checkpoint timing in DCP recipe#3688
patrocinio wants to merge 1 commit intopytorch:mainfrom
patrocinio:3584_Asynchronous_Saving

Conversation

@patrocinio
Copy link
Contributor

@patrocinio patrocinio commented Dec 8, 2025

Move checkpoint_future.result() before optimizer.step() to ensure the previous checkpoint completes before weights are modified in-place. This allows better overlap of checkpointing with forward/backward passes.

Fixes #3584

Description

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

cc @wconstab @osalpekar @H-Huang @kwen2501

Move checkpoint_future.result() before optimizer.step() to ensure
the previous checkpoint completes before weights are modified in-place.
This allows better overlap of checkpointing with forward/backward passes.

Fixes pytorch#3584
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3688

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f914e75 with merge base 7f8b6dc (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as stale.
Feel free to remove the stale label if you feel this was a mistake.
If you are unable to remove the stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the stale Stale PRs label Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feedback about Asynchronous Saving with Distributed Checkpoint (DCP)

2 participants